R Markdown file

R Markdown weaves together narrative text and code (see code chunk below), improving the readability/reproducibility of your code. For more details on using R Markdown refer to this quick tour or this website. For syntax in R Markdown refer to this cheat sheet.

Note you may need to install the rmarkdown package, if you have not done. Use code: install.packages("rmarkdown")

This is an R Notebook file (see YAML header: output: html_notebook). An R Notebook is an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input or in the console. This feature of the R Notebook makes it a good choice while creating an R Markdown document. When you Save this R Markdown document, an html file should be generated automatically. When you are ready to publish the document, you can render it to a publication format (e.g., HTML, Word, pdf) with the Knit button. For more about R Notebook, see this book by Xie and colleagues

# This is an example of an R code chunk.
# For this session, please only edit code within R code chunks like this, focusing on the missing pieces shown with "___".




Step 1. Prepare for coding

Step 1a. Load relevant packages

For this session, we will mostly rely on the following packages:

  • tidyverse for importing and wrangling data
  • ggplot2 for graphing data
  • here for referencing files
  • zoo for calculating rolling averages
  • lubridate for working with dates

Make sure these packages are installed. If not, use the function install.packages() to download and install relevant packages.

Note, additional packages will be required for specific sections. We will introduce and load them when needed.

# load packages
library(tidyverse)
library(ggplot2)
library(here)
library(zoo)
library(lubridate)


Step 1b. Inspect working directory

It’s always a good practice to inspect our working directory. This can be done using getwd(). We can also manual set our working directory using setwd().

here::here() is a nifty function that simplifies writing file path. It also pairs well with R Project. We will use it repeatedly in this session.

# inspect working directory
getwd()
## [1] "/Users/xiuqili/Documents/GitHub/DataClub_2022Fall/Session-1/code"
# inspect here::here()
here::here() 
## [1] "/Users/xiuqili/Documents/GitHub/DataClub_2022Fall"




Step 2. Create a line plot

In this part, we will recreate the New York Time “New reported cases” graph

Step 2a. Import data from GitHub

We can use read_csv() to import data directly from GitHub.

# import data directly from GitHub
US_data <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv") 
## Rows: 973 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (2): cases, deaths
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# inspect data briefly
str(US_data)
## spec_tbl_df [973 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date  : Date[1:973], format: "2020-01-21" "2020-01-22" ...
##  $ cases : num [1:973] 1 1 1 2 3 5 5 5 5 6 ...
##  $ deaths: num [1:973] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   cases = col_double(),
##   ..   deaths = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(US_data)
##       date                cases              deaths       
##  Min.   :2020-01-21   Min.   :       1   Min.   :      0  
##  1st Qu.:2020-09-20   1st Qu.: 6825950   1st Qu.: 199367  
##  Median :2021-05-21   Median :33099318   Median : 588850  
##  Mean   :2021-05-21   Mean   :37659506   Mean   : 541669  
##  3rd Qu.:2022-01-19   3rd Qu.:68557295   3rd Qu.: 857644  
##  Max.   :2022-09-19   Max.   :95514998   Max.   :1050282


Step 2b. Wrangle data

The GitHub dataset we imported in Step 2a shows the daily number of cases and deaths nationwide. We will need to calculate:

  1. the number of new cases for each day
    • hint: dplyr::lag()
  2. the rolling 7-day average
    • hint: zoo::rollmeanr()


Step 2c. Graph data

New York Time “New reported cases” graph shows:

  • 7-day average of new cases on the y-axis
  • dates on the x-axis.

We will use ggplot to recreate this graph.




Step 3. Create a line plot using facet wrap

facet_wrap() and facet_grid() can be used to easily create multi-panel plots. In this section, we will use facet_wrap() to graph the rolling 7-day average of new COVID cases per 100k people for each state.

This will require several steps:

  1. Import COVID data from GitHub and population data from file
    • State COVID data posted by New York Times can be found here
    • State population data can be found in “data/population-us-2020-census.csv”. This dataset was accessed from US Census Bureau. It has been pre-cleaned for you to simplify this part of the analysis.
  2. Combine the 2 datasets
    • note: New York Times COVID dataset does not contain data from the territories
  3. Calculate the 7-day rolling average of new cases for each state, and normalize by state population
    • hint: group_by() followed mutate()
  4. Generate multi-panel graph
    • hint: facet_wrap()


For fun: the geofacet package works similarly as facet_wrap(), except it arranges panels following a grid that mimics the original topology. For more information on geofacet, see here.

# load library
library(geofacet)

# use facet_geo() instead of facet_wrap()
ggplot(data = State_daily_graph,
       mapping = aes(x = date, y = cases.daily.ave.density)) +
  geom_line(color = "grey45") +
  facet_geo(~ state) +
  theme_bw() +
  theme(panel.grid = element_blank()) +
  labs(title = "New reported COVID-19 cases in different states",
       subtitle = "Average daily cases per 100,000 people",
       caption = "Data source: New York Times. Retrieved from GitHub.")

# export graph
ggsave(filename = "State_AverageDailyCasesPer100k_GeoWrap.png", 
       path = here::here("Session-1", "graphs"),
       width = 14, height = 8, units = "in", bg = "white")




Step 4. Test a hypothesis and create an interactive plot

Let’s investigate the relationship between having a college/university in the county and masking rates within that county.

New York Times posted a survey that was carried out in July 2020, during the 2nd peak of the pandemic. Data from this survey was used to predict mask-wearing in different counties. Data and metadata are available on the New York Times GitHub. See here.

The us-colleges-and-universities.csv file in the data folder lists Colleges, Universities, and Professional Schools within the US. This .csv file was derived from this dataset on US Colleges and Universities. The us-colleges-and-universities.csv file has been simplified to spead up today’s analysis.

The county_fips.txt file in the data folder lists counties in the US, states they belong to, and their fips code. This was accessed from the USDA website.

For the analysis, we will need to:

  1. Calculate the rate of encountering someone wearing a mask in each state

    • for the purpose of this analysis, we can assume:
      • NEVER = 0% chance of wearing a mask
      • RARELY = 25% chance of wearing a mask
      • SOMETIMES = 50% chance of wearing a mask
      • FREQUENTLY = 75% chance of wearing a mask
      • ALWAYS = 100% chance of wearing a mask
    • Thus, the chance of encountering someone wearing a mask would be: NEVER * 0 + RARELY * 0.25 + SOMETIMES * 0.5 + FREQUENTLY * 0.75 + ALWAYS * 1
  2. Wrangle data and combine the datasets

  3. Create a violin + dot plot to compare masking rates in counties with colleges/universities to those without

  4. Perform t-test

  5. Create an interactive plot


Step 4a. Wrangle and visualize data (exploratory data analysis)


Step 4b. Perform t-test

We can use t.test() to compare the two groups.

  • Hint: look at the documentation for t.test() to understand the assumptions made in this statistical analysis.
## 
##  Welch Two Sample t-test
## 
## data:  chances by college.county
## t = -15.959, df = 1639.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -0.06998475 -0.05466467
## sample estimates:
##  mean in group No mean in group Yes 
##         0.7277745         0.7900992


We want to note here that Step 4a and 4b are not meant to be a scientific analysis of COVID-related data. We are using this to process model data wrangling and statistical analysis in R. Of note,

  • Correlation does not equate causation. Many other factors could be influencing the two variables we looked at.
  • We made multiple assumptions during our data analysis procedure. These assumptions could all impact outcome of the analysis.


Step 4c. Create an interactive plot

In this step, we will turn the violin/dot plot above into an interactive plot. This will require the package plotly. For more about plotly, specifically ggplotly() refer to this website.

We can display interactive plots as part of this R Markdown file, or export it as an independent html file. We will need the package htmlwidgets to export the interactive plot.




Step 5. Create an animated map

In this step, we will create an animated map showing 7-day rolling average of new cases in each state (normalized to state population). We will start by creating a static map.


Step 5a. Create an static map

The maps package is the great source of geospatial data in R. It’s great for making some basic maps. For more on creating maps using the maps package, see this webpage.


Step 5b. Create an animated graph using gganimate

We will now build upon the static map and create an animated map showing 7-day rolling average of new cases in each state (normalized to state population).

The gganimate package builds upon the grammar of ggplot and provide functions for animation. For a cheatsheet of gganimate see (here)[https://ugoproto.github.io/ugo_r_doc/pdf/gganimate.pdf]

We’ll use the package gifski to save the animated map we generated as a gif




Step 6. Wrap up

Congrats on reaching the last step! We hope this session has been interesting & informative.

Adding a date stamp and exporting the session info are great practices to ensure the reproducibilty of your code. The next code chunk helps you do both.

# Add date stamp
lubridate::stamp("Data updated December 31, 1979")(lubridate::now())
## Multiple formats matched: "Data updated %Om %d, %Y"(1), "Data updated %B %d, %Y"(1)
## Using: "Data updated %B %d, %Y"
## [1] "Data updated September 20, 2022"
# Add session info
# note, the [-8] omits packages installed, but not loaded for this analysis
utils:::print.sessionInfo(sessionInfo()[-8]) 
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gifski_1.6.6-1    gganimate_1.0.8   maps_3.4.0        htmlwidgets_1.5.4
##  [5] plotly_4.10.0     geofacet_0.2.0    lubridate_1.8.0   zoo_1.8-11       
##  [9] here_1.0.1        forcats_0.5.2     stringr_1.4.1     dplyr_1.0.10     
## [13] purrr_0.3.4       readr_2.1.2       tidyr_1.2.1       tibble_3.1.8     
## [17] ggplot2_3.3.6     tidyverse_1.3.2


Reruning everything at the very end also helps ensure that your code is reproducible. To do so:

  1. Clear objects from the workspace by going to Session > Clear Workspace… This will allow you to run code in this file without interference from prior runs, thus, testing the reproducibility of your code.

  2. Open the “Run” dropdown manual, and click on “Run All” (upper right of Script Editor). This will run all the code in this document.

    • Note, if the code stops running before you hear Mario tunes, this means you have an error that’s preventing the code from running to completion.
  3. Once you hear Mario tunes, click the “Save” button (upper left of Script Editor). This should automatically generate/update an html file that you can share with a colleague.



Time to celebrate with some Mario tunes!

# B/c Mario is awesome
library(beepr)
beep(sound = 8)